fix: impute NaN before variable-selection steps in stats.py by drussellmrichie · Pull Request #313 · larsiusprime/openavmkit

drussellmrichie · 2026-03-31T15:58:00Z

Problem

sklearn (ElasticNet) and statsmodels (OLS/VIF) raise errors when input features contain NaN values. This is triggered when a dataset uses LightGBM native NaN-handling -- e.g. sparse binary indicators like has_garage where NaN means "not recorded" -- and the variable-selection pre-pass runs before LightGBM training.

Errors seen:

ValueError: Input X contains NaN (ElasticNet / sklearn)
MissingDataError: exog contains inf or nans (statsmodels OLS)

Fix

Add median imputation of NaN to the top of each of the four variable-selection functions:

calc_elastic_net_regularization
calc_p_values_recursive_drop
calc_t_values_recursive_drop
calc_vif_recursive_drop

Imputation is scoped to these pre-passes only: LightGBM training still receives the real NaN values and handles them natively at each split. A UserWarning is emitted listing the affected columns. Median imputation is a neutral choice for a variable-selection screen and does not bias which variables survive the screen.

sklearn (ElasticNet) and statsmodels (OLS/VIF) raise errors when input features contain NaN values. This is triggered in practice when a dataset uses LightGBM's native NaN-handling (e.g. sparse binary indicators like "has_garage" or "has_fireplace" where NaN means "not recorded") and runs the variable-selection pre-pass before LightGBM training. The fix adds median imputation of NaN to the top of each of the four variable-selection functions: - calc_elastic_net_regularization - calc_p_values_recursive_drop - calc_t_values_recursive_drop - calc_vif_recursive_drop Imputation is scoped to these pre-passes only: LightGBM training still receives the real NaN values and handles them natively at each split. A UserWarning is emitted listing the affected columns so the user are aware. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-03-31T15:58:10Z

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.
A maintainer will verify your signature and confirm it here by commenting with the following sentence:

I affirm that this contributor has signed the CLA

_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: impute NaN before variable-selection steps in stats.py#313

fix: impute NaN before variable-selection steps in stats.py#313
drussellmrichie wants to merge 1 commit intolarsiusprime:masterfrom
drussellmrichie:fix/stats-nan-imputation-variable-selection

drussellmrichie commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drussellmrichie commented Mar 31, 2026

Problem

Fix

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant